Introduction

This week, we’re going to talk about correlations in R. But just as importantly, we are going to get you started on perhaps the most important skill in psychological research – working with data. Follow along below.

Workspace

First, we start by loading the packages and data we’ll need.

Packages

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.2     ✔ stringr 1.4.0
## ✔ readr   1.1.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()     masks psych::%+%()
## ✖ ggplot2::alpha()   masks psych::alpha()
## ✖ dplyr::arrange()   masks plyr::arrange()
## ✖ purrr::compact()   masks plyr::compact()
## ✖ dplyr::count()     masks plyr::count()
## ✖ dplyr::failwith()  masks plyr::failwith()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::id()        masks plyr::id()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::mutate()    masks plyr::mutate()
## ✖ dplyr::rename()    masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()

Load Data

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   ts.RS03.w1 = col_double(),
##   ts.RS04.w1 = col_double(),
##   ts.RS05.w1 = col_double(),
##   ts.RS06.w1 = col_double(),
##   ts.i1.NOM09.w1 = col_double(),
##   ts.i1.NOM11.w1 = col_double(),
##   ts.i2.NOM06.w1 = col_double(),
##   ts.i2.NOM07.w1 = col_double(),
##   ts.i3.NOM08.w1 = col_double(),
##   ts.i7.NOM06.w1 = col_double(),
##   ts.AGQ03.w1 = col_double(),
##   ts.DD.NQ09.2.w1 = col_double(),
##   ts.DD.NQ09.4.w1 = col_double(),
##   ts.CRT.ss.w1 = col_double()
## )
## See spec(...) for full column specifications.

Preparation and Data Cleaning

Now that the data are loaded, we usually need to make modifications, keeping some columns, removing others, creating composites, apply exclusion criteria, etc.

Descriptives

To do this, it’s usually smart to look at some descriptives of your data, especially if you aren’t very familiar with it.

One useful function for this is the describe() function in the psych package

Selection and Renaming

When I work with data, I’m usually pulling a few variables from data sets that may contain thousands of variables. I often find the easiest way to do this is to use the codebooks associated with the study to create an Excel sheet that lists a number of properties, including the original column name of the item in the data, a new name that I’d like to give it, the purpose of the variable (e.g. outcome variable, predictor variable, covariate, etc.), whether I need to reverse score it, the scale participants responded on, what values represent missing values (this will vary by data set). See the example I’ve provided and create your own codebook of variables you find interesting.

Handle Missing Values

We want to make sure R recognizes missing values as missing. Below, you’re going to need to make sure that missing values are specified as NA, not as another value.

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

Recode Variables

Almost always, there are variables that are not coded how we want. It could be an ordinal or nominal scale that has levels we don’t want or needs reverse coded. It could be we want to collapse variables. Whatever it is, it’s important to do.

## Joining, by = "new_name"

Create composites

Usually, there are variables that we need to composite. It could be a personality scale, it could be the same item answered over time. Whatever it is, you’ll do it a lot. There are lots of ways to go about it. Below, you should try to do this a couple of different ways and see if you achieve the same results.

## Joining, by = "SID"

Zero-Order Correlations

##       BFI.A BFI.C BFI.E BFI.N BFI.O   Sat
## BFI.A  1.00  0.21  0.13 -0.34  0.07  0.31
## BFI.C  0.21  1.00  0.09 -0.15 -0.06  0.27
## BFI.E  0.13  0.09  1.00 -0.27  0.14  0.27
## BFI.N -0.34 -0.15 -0.27  1.00 -0.08 -0.41
## BFI.O  0.07 -0.06  0.14 -0.08  1.00  0.04
## Sat    0.31  0.27  0.27 -0.41  0.04  1.00
## Call:corr.test(x = pairsW1.tidy %>% select(BFI.A:Sat))
## Correlation matrix 
##       BFI.A BFI.C BFI.E BFI.N BFI.O   Sat
## BFI.A  1.00  0.21  0.13 -0.34  0.07  0.31
## BFI.C  0.21  1.00  0.09 -0.15 -0.06  0.27
## BFI.E  0.13  0.09  1.00 -0.27  0.14  0.27
## BFI.N -0.34 -0.15 -0.27  1.00 -0.08 -0.41
## BFI.O  0.07 -0.06  0.14 -0.08  1.00  0.04
## Sat    0.31  0.27  0.27 -0.41  0.04  1.00
## Sample Size 
## [1] 414
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##       BFI.A BFI.C BFI.E BFI.N BFI.O  Sat
## BFI.A  0.00  0.00  0.04  0.00  0.44 0.00
## BFI.C  0.00  0.00  0.35  0.02  0.44 0.00
## BFI.E  0.01  0.07  0.00  0.00  0.03 0.00
## BFI.N  0.00  0.00  0.00  0.00  0.44 0.00
## BFI.O  0.15  0.20  0.00  0.11  0.00 0.44
## Sat    0.00  0.00  0.00  0.00  0.43 0.00
## 
##  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
##             raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## BFI.A-BFI.C      0.12  0.21      0.30  0.00      0.08      0.34
## BFI.A-BFI.E      0.04  0.13      0.23  0.01      0.01      0.26
## BFI.A-BFI.N     -0.43 -0.34     -0.25  0.00     -0.46     -0.21
## BFI.A-BFI.O     -0.02  0.07      0.17  0.15     -0.05      0.19
## BFI.A-Sat        0.22  0.31      0.40  0.00      0.18      0.44
## BFI.C-BFI.E     -0.01  0.09      0.18  0.07     -0.04      0.21
## BFI.C-BFI.N     -0.24 -0.15     -0.05  0.00     -0.28     -0.01
## BFI.C-BFI.O     -0.16 -0.06      0.03  0.20     -0.17      0.05
## BFI.C-Sat        0.18  0.27      0.36  0.00      0.14      0.40
## BFI.E-BFI.N     -0.36 -0.27     -0.18  0.00     -0.40     -0.14
## BFI.E-BFI.O      0.04  0.14      0.23  0.00      0.01      0.27
## BFI.E-Sat        0.18  0.27      0.35  0.00      0.13      0.39
## BFI.N-BFI.O     -0.17 -0.08      0.02  0.11     -0.20      0.04
## BFI.N-Sat       -0.49 -0.41     -0.33  0.00     -0.52     -0.28
## BFI.O-Sat       -0.06  0.04      0.13  0.43     -0.06      0.13

Regression

There are lots of helper functions for regression, like:

summary(), which prints a summary of the results

## 
## Call:
## lm(formula = Sat ~ BFI.E, data = pairsW1.tidy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2532 -1.2982  0.1209  1.3994  4.9508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.62125    0.32608  26.439  < 2e-16 ***
## BFI.E        0.19199    0.03406   5.637 3.21e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.989 on 412 degrees of freedom
## Multiple R-squared:  0.07161,    Adjusted R-squared:  0.06935 
## F-statistic: 31.78 on 1 and 412 DF,  p-value: 3.212e-08

coef(), which prints the coefficients of the model

## (Intercept)       BFI.E 
##   8.6212501   0.1919857

residuals(), which prints the residuals of the model

##           1           2           3           4           5           6 
## -0.04110684 -1.60508713  2.41481435  0.44286450 -0.79707280  2.24682689

predict(), which generates predicted values for all the observations of X, using the model

##         1         2         3         4         5         6 
## 10.541107 10.805087  9.485186 10.157135 10.997073  9.653173

The tidy solution

The broom package offers a solution for annoying S3 class stuff associated with lm-class in R.
It has great functions like:

tidy(), which prints a data frame of the coefficients, with the standard errors, t-test statistics associated with the estimates, and p values.

glance(), which summarizes model fit indices, like \(R^2\)

augment(), which adds columns with the predictions, residuals, etc. for each data observation

Assumptions

1. No measurement Error

  • influence our ability to control and predict
  • this will never occur with psychological data
  • how bad is it if we don’t meet this assumption
  • the better you assess you DV’s or IV’s, the more you can trust your models
  • you can test this by assessing the reliability of your measures using Cronbach’s alpha

Run Cronbach’s Alpha - this is actually a really scale to check. We just need something demonstrative.

## 
## Reliability analysis   
## Call: psych::alpha(x = pairsW1.tidy %>% select(contains("BFI.E_")))
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean  sd median_r
##        0.9       0.9    0.91      0.53 9.1 0.0074  9.1 2.9     0.53
## 
##  lower alpha upper     95% confidence boundaries
## 0.89 0.9 0.91 
## 
##  Reliability if an item is dropped:
##         raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## BFI.E_1      0.88      0.88    0.88      0.51 7.3   0.0090 0.0149  0.50
## BFI.E_3      0.88      0.88    0.89      0.52 7.5   0.0088 0.0160  0.50
## BFI.E_4      0.89      0.89    0.89      0.54 8.2   0.0081 0.0144  0.55
## BFI.E_5      0.90      0.91    0.91      0.58 9.6   0.0071 0.0089  0.56
## BFI.E_8      0.89      0.89    0.89      0.53 8.1   0.0082 0.0139  0.54
## BFI.E_2      0.88      0.89    0.89      0.53 7.8   0.0087 0.0150  0.52
## BFI.E_6      0.89      0.89    0.89      0.54 8.1   0.0084 0.0144  0.54
## BFI.E_7      0.88      0.88    0.88      0.51 7.4   0.0092 0.0146  0.51
## 
##  Item statistics 
##           n raw.r std.r r.cor r.drop mean  sd
## BFI.E_1 413  0.83  0.84  0.82   0.78 10.2 3.6
## BFI.E_3 414  0.81  0.81  0.79   0.75 10.3 3.5
## BFI.E_4 414  0.73  0.75  0.71   0.65 10.3 3.4
## BFI.E_5 414  0.61  0.61  0.51   0.49  9.0 3.9
## BFI.E_8 414  0.74  0.76  0.72   0.67 10.1 3.3
## BFI.E_2 414  0.80  0.79  0.76   0.72  8.1 4.0
## BFI.E_6 413  0.77  0.76  0.72   0.68  6.6 4.0
## BFI.E_7 414  0.85  0.84  0.82   0.78  8.5 4.3

2. Correctly specified form

  • make sure you are measuring what you think you are measuring?
  • Are you using the correct model?

3. No omitted variables

  • there are infinite amount of things
  • this is another assumption you will never meet
  • basically look for hidden moderators and third variables

4. Homoscedasticity

  • conditional variance of residuals aross all levels of X is constant
  • if violated = heteroscedasticity
  • make sure there is constant error variance among all of the X’s
  • Residuals v. predictors

5. Independence among the residuals

  • issues with group, family, longitudinal, and nested designs
  • take care of this with different types of models (e.g ml models)
  • want to see that the residuals don’t look different across time

6. Normally Distributed Residuals

## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:psych':
## 
##     logit

## [1] 167 218